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Abstract 

The model evidence is a vital quantity in the comparison of statistical models under 
the Bayesian paradigm. This papers presents a review of commonly used methods. 
We outline some guidelines and offer some practical advice. The reviewed methods 
are compared for two examples; non-nested Gaussian linear regression and covariate 
subset selection in logistic regression. 

Keywords and Phrases: evidence; marginal likelihood; Markov chain Monte 
Carlo; harmonic mean estimator; power posteriors; annealed importance sampling; 
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1 Introduction 



The advent of Markov chain Monte Carlo methods, since Geman and Geman (1984) and 



especially Gelfand and Smith ( 1990 ) have led to what might be termed a Bayesian revolution. 



It is now routine to carry out a Bayesian analysis for quite complex models with often 
high dimensional data. In particular, WinBUGS (Spiegelhalter et al 2003) has allowed the 
0\ practitioner access to many types of statistical models which are amenable to Gibbs sampling. 

^ ! This software has played a major role in the popularisation of Bayesian analysis to a very 

^ wide audience. 

I More frequently, the practitioner is now not just interested in analysing data assumed 

^ from a single statistical model, but rather from the starting point where one assumes that 

^ there are a collection of models which could have plausibly generated the data. The usual 

^ objective is to understand the uncertainty associated with each statistical model, or indeed 

to use this model uncertainty to aid prediction of, for example, future observations. The 
Bayesian paradigm offers a principled approach to the issue of model choice, through exam- 
ination of the model evidence, namely the probability of the data given the model. Suppose 
we are given data y and assume there are a collection of competing models, mi, . . . ,mi, 
each with associated parameters, 9i,. . . ,9i, respectively. Viewing the model indicators as 
parameters with prior distribution iT{mk), the posterior distribution of interest is 

7r(6'fe,mfe|y) oc 7T{y\9k,mk)n{9k\mk)7T{mk) 
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where n{y\6k,mk) is the hkehhood of the data under model with parameters 6k and 



7r(6'fc|mfc) is the prior on the parameters in model m^. The seminal paper by Green (1995) 
which introduced the reversible jump Markov chain Monte Carlo algorithm, allows sampling 
of this joint model and parameter space and is itself a topic in this issue. This paper focuses 
on the complimentary approach whereby one considers the posterior distribution conditional 
on model m^, 

7r(6'fc|y,mfc) oc 7i{y\9k,mk)7i{9k\mk). 
The constant of proportionality for the un-normalised posterior distribution above is 

7T{y\mk) = / 'K{y\9k,mk)-K{9k\mk)dek. 



This is a vital quantity in Bayesian model choice and, somewhat confusingly is given differ- 
ent names in the literature, the marginal likelihood, integrated likelihood or evidence. The 
terminology, evidence is the perhaps the most appealing and we will use this throughout this 
article. In general, for almost all statistical models the evidence is analytically intractable, 
since this involves a high- dimensional integration over a usually complicated and highly 
variable function. Computational tractability is rarely possible, and in such cases, relies on 
conjugate priors. An interesting and recent example where conjugacy is used to this effect 



is in the paper by Sabanes and Held (2011), in the context of generalized linear models. 
Additionally, the evidence is very sensitive to the prior, so that small changes in the prior 
specification can lead to large changes in the evidence. 

This article is solely concerned with reviewing computational methods to estimate the 
model evidence while the alternative approach to consider the posterior distribution of pa- 
rameters and model indicators will be dealt with in an article by Hastie and Green in this 
issue. Here we present some more recent developments in this area. This article is aimed 
at readers not necessarily familiar with Bayesian methods. The intention is to provide a 
starting point from which interested readers can delve further. 



2 Why do we need the evidence? 

Assuming that we could compute 7r(?/|mfc), then Bayes theorem can be used to combine the 
collection of evidence terms for each model to form posterior model probabilities, 

7r(mfc|?/) 



Ej=i7r(|/|mj)7r(mj) 

Moreover, if interest surrounded the comparison of two competing models, a natural quantity 
of interest is the Bayes factor which is nothing more than the ratio of evidence terms for 
both models, 

'n{y\mi) 
■n[y\mj) 
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Indeed the ratio of posterior model probabilities can be expressed, using ([T]) as 

7iimi\y) 7i{y\mi 



rami 



X 



7r{mj\y) 7r{y\mj) 7r(m 



The larger BFij is, the greater the evidence in favour of rrii compared to rrij. Jeffreys (1961) 



presents a scale in which one can interpret the strength of evidence of one model against 



another. The reader is also referred to Kass and Raftery (1995 ) who present a comprehensive 
review of Bayes factors. 



Bayesian model averaging (Hoeting et al 1999) offers a very natural way to make predic 



tions by averaging over all models, weighted proportional to their posterior model probabili- 
ties, thereby incorporating model uncertainty. For example, if one were to make a prediction 
for a future observation (or indeed a collection of observations) y* , then a natural quantity 
of interest is 



'^i.y*\y) = X]7r(y*|mfc,?/)7r(mfc|?/). 



k=l 



This is the average of the posterior distribution for y* under each model weighted by the 
corresponding posterior model probabilities. 



3 Review of methods to compute the model evidence 

Here we present, in chronological order, a short summary of various approaches to compute 
Ti{y\mk)i the evidence for model m^. In the remainder of this section we assume that we 
are estimating the evidence within a given model so that we suppress conditioning on the 
model TTik and the subscript indexing the model on the parameters 9. Subscripts on 9 will 
be defined for the given context. 



3.1 Laplace's method 



An early and widely used method is Laplace's method (Tierney and Kadane 1986). This 
approach makes the assumption that the posterior distribution can be adequately approxi- 
mated by a normal distribution, for example if the sample size is large enough. Obviously 
this is a very strong assumption which might not often hold. Specifically, assume that 7r(^|y) 
is highly peaked around the posterior mode 9. Define 



m = \og{7^{y\9)n{9)}. 

Expanding l{9) as a quadratic about 9 and exponentiating leads to approximating 'ir{y\9)'K{9) 
as a Gaussian with mean 9 and covariance S = {—D'^l{9))'^, where D'^l{9) is the Hessian 
matrix of second derivatives. Integrating this approximation yields 

niy) ^ {27rY/'\t\'/My\0M9). (2) 



3 



3.1.1 Running example: Gaussian model 



As a means to demonstrating how the different approaches to estimating the evidence are 
applied in practice we adopt a simple running example which will be revisited after each 
approach is introduced. We consider a sample of size n independently distributed according 
to a A^(/i, T~^) distribution. The priors on ^ and r are assumed A^(^, z/~^) and Ga(ao/2, &o/2) 
respectively. Here 6 = {fi,Ty. Thus, 



/ r \ "/2 j r sr^ 



and (ignoring unnecessary constants) 
'n + ao 



m 



The gradient is given by 



1 loe 



r 
2 



vi{e) = 

and the Hessian is 



DH{e) 



- Er=i Vi ( 



1) /7 



It is possible to find the mode iteratively using a Newton method. Starting with a guess 
one iterates 

Qiw) ^ Qit) _ ^ [^2;(^W)] -1 v/(^W), t = 0, 1, . . . 

until some convergence criterion is satisfied (e.g. |/(^(*+i)) - /(^W)] < lO^^). The parameter 
7 in the iterations is a step-size which is usually chosen to be equal to 1. An estimate of 
TT{y) may then be obtained by taking 9 as the last iterate and evaluating nM. 



3.2 Harmonic mean estimator 



The harmonic mean estimator (Newton and Raftery 1994) is an easy to implement estimator 
of the evidence, based on draws from the target distribution. Suppose 0'^^\ . . . , 6^^'^ is a sam- 
ple generated from the posterior vr(6'|y) using MCMC Then after computing the likelihood 
of each 6'*^*\ 7r(j/|6'*^*^), i = 1,. . . , A^, the evidence is estimated as the harmonic mean of the 
likehhood, 

/i ^ 1 \ 

(3) 
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This estimator is easily shown to hold since it follows from the identity 



There is a severe downside to this estimator. Equation (|3j) is based solely on draws from the 
posterior. But the posterior is typically much more peaked than the prior, e.g., when the 
posterior is insensitive to the prior. Hence in such situations, the harmonic mean estimator 
will not change much as the prior changes. However it is well known that 7r{y) is very 



sensitive to changes in the prior. This drawback is very well documented, see (Robert and 



Wraith 2009), for example. 



3.2.1 Running example: Gaussian model 

This model may be sampled from using a Gibbs sampler. The sample 9^^\ . . . , 9^^^ is drawn 
iteratively by drawing ^(*+^) = (^n^'^~^^\ r^^'^^'^Y conditioning on 9^'^^: 



(4) 



r(^+i) ~ Ga I \ 



t=i 



The final sample 9^^\ . . . , 6'*^^^ is obtained by throwing away some number of burn-in itera- 
tions at the beginning of the chain. 

To compute the harmonic mean estimator we calculate log 7r(y|^*^*^), i = 1,...,A^ and 
note that ([s]) may be rewritten 

/ ^ . \ 

log7r(?/) ^ log — log j exp{— m} ^^exp{— log 7r(?/|^^^*'') + m} j 

where m = max{log7r(y|6"^*)),i = 1,...,N}. This approach to evaluating the harmonic 
mean should be more numerically stable than a direct implementation in cases where the 
log likelihoods are large negative numbers. 

3.2.2 A tractable example 

To illustrate the sensitivity of this estimator to changes in the prior consider the following 
simple example. Suppose that data yi ~ A^(/i, r~^), for i = 1, . . . , n. Let's suppose that 
conjugate priors for fj, and r follow a iV(/io, l/(ror)) and Ga{ao, bo) distribution, respectively. 



This defines a so-called normal-gamma model (Bernardo and Smith 1994), for which the 
full-conditional distributions are 



/i 



r ~ A^(/i„,r„ ^) and r|/i ~ G'a(a„, 6„;, 
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where 



lJ^n= ^ ; r„ = ro + n; = + ri/2; hn = bQ + - } [yi - y) +^77 ^ — 

To + n 2 ^ 2(ro + n) 

This model is also one for which the evidence can be computed analytically, 

Here we simulated n = 100 observations, with /i = and r = 1. Prior parameters for 
Qq, bo and /zq were set to 0.001, 0.001 and 0, respectively. Our interest was to see how 
the harmonic mean estimator performed for varying values of tq. In Table [T| the harmonic 
mean estimate of the logarithm of the evidence, log7r(?/), based on 10^ iterations of a Gibbs 
sampler, is presented for varying values of tq. Clearly, these estimates do not vary much 
with To, in contrast to the true value of log7r(?/) which varies as the prior for fi changes. 





0.0001 


0.01 


0.1 


1 


lo^ 




-160.54 


-158.24 


-157.09 


-155.95 


lo^ 




-148.33 


-148.45 


-148.59 


-148.85 



Table 1: The true value of log7r(y) is presented for various values of tq, together with 
estimates of the logarithm of TT{y), based on the harmonic mean estimator. 



As above, the explanation for the poor performance of the harmonic mean estimator lies 
in the fact that the posterior distribution does not change very much, as the prior distribution 
for /i becomes more diffuse, from tq = 1 through to tq = 0.0001. Therefore, samples from 
the four different posterior distributions in this example will be very similar, as reflected in 
the similar harmonic mean estimates. However, the evidence does depend on how diffuse 
the prior for /i is, since it involves an integration over /i (and r). 



3.3 Chib's method 



Chib (|1995|) presented a generic method which can be applied to output from the Gibbs 

Ti re-£ 



sampler. This estimator follows from re-arranging Bayes' formula to yield 

niy\9)ni9) 



nie\y) • 

Using this identity one could estimate log7r(?/) as 

log7r(y) = \ogn{y\e*) + log7r(r) - log7r(r |y) (5) 



where, for example, Tf{6*\y) is an estimate of the posterior density at a point 6* of high 
posterior probability, although the equality holds for any 6*. Examining the right hand 
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side of ([5]), both the prior and hkehhood terms, ti{9*) and 7r{y\9*), respectively, can usually 
be evaluated in closed form, for most statistical models. The estimation of the posterior 
probability, 7r{d*\y), is troublesome. The important contribution of Chib was to realise that 
7T{6*\y) can be estimated by a Monte Carlo average based on draws from the Gibbs sampler. 

Suppose the vector 9 can be partitioned as (^1,^25^3)5 where the full-conditional distri- 
bution of each 6i is of standard form. The law of total probability allows one to write 

n{e*\y) = n{ei\e;,eiy)ir{e;\eiy)ir{ei\y) 

Gibbs sampling can be used to estimate each factor on the left hand side. 



TT 



1 ^ 



N 

i=l 



1 ^ 

■t=l 

Clearly, this approach can be extended to high-dimensional parameter spaces. Chib's method 
can obviously be used in many situations, given the wide applicability of the Gibbs sampler 
for a very wide class of statistical models. Certainly, this is the case - it is clear from 
the number of citations which this article has gained that this approach is popular among 
practitioners. In terms of computational implementation, a requirement is that draws from 
various full-conditional distributions need to be stored, although this is rarely prohibitive. 



Note also that this method has been extended (Chib and Jeliazkov 2001 ) so that the evidence 



can be estimated based on output from a Metropolis-Hasting sampler. Finally, it should be 
mentioned that difficulties can arise when this method is applied to mixture models, hidden 
Markov models and other models which give rise to label switching and parameter non- 
identifiability. The difficultly can arise because for a finite mixture model with k components, 
for example, it can take a long time, in practice, for a Gibbs sampler to visit all the unique k\ 
modes. See www. cs .utoronto . ca/°/o7Eradf ord/f tp/chib-letter .pdf for a more detailed 
discussion of this issue. 



3.3.1 Running example: Gaussian model 

The parameter 9 may be partitioned in this case into (/i, t). Draw a sample 9^^\ . . . ,9^^^ 
using the Gibbs sampling scheme Q. We will first evaluate 7r{T*\y). This is done by 
computing 

AT / 

n + aQ 1 
2 ' 2 



TT r 



Is) 



t = l 



(i)N2 



+ &0 



where Ga{x\a, b) represents the gamma density function with shape a and rate b evaluated at 
X. This should be maximized over the range of r for a good approximation to the posterior 
ordinate. For example, one could compute 7r{T^^'\y) for each r(j') in the sample, and then 
take T* = arg max^g|^(i)^ ^^{iv)|7r(r|?/). 
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Equipped with r* we can compute the mode of the full conditional of /i|r*. This is more 
straightforward since /i|r* is normally distributed with mode at ^* = (r* Ym=i Vt + ^0 I {nT*^ 
v) so that 7r(/i*|r*,?/) = {tlt* + vfl'^j^f^. Denoting Q* = {n*,T*y, 

log 7r(r \y) = ^ log(nr* + u) - ^ log(27r) + log 7r(r* \y) 
which can be used to obtain (Isl) with 9*. 



3.4 Annealed importance sampling 

Importance sampling plays a vital role in Monte Carlo sampling. Suppose that the target 
distribution of interest is vr(x), which is not amenable to direct sampling, but that tt{x) is an 
approximation of it, from which one can sample easily referred to as the importance function. 
Here x is used to generically denote some random variable and 7r(-) its density, most likely 
up to an unknown constant of proportionality. Then draws x^^\ . . . ,x^^^ from ti{x) can be 
used as an importance sampling estimate of the expectation of some vr-integrable function 
a(-) since 



/f 7t(x^ 1 ^ 

a{x)TT{x) dx = a{x)——fc{x) dx = E^w(x)a(a;) ~ — 
' i=l 



w(x^'^)a(x^'^] 



where the importance weights are defined as 

7r(x) 



w{x) 



TCiX] 



once 7r(-) and 7r(-) are normalized. 

When one or both of 7r(-) and 7r(-) are not normalized, the ratio of their normalizing 
constants z^/zji (with z.„ = n{x) dx and Zi, = 7r(x) dx) may be approximated by noting 

z \ f \ f Trfx) f 7r(x) 

— = — / 7r(x) dx = — / — — 7r(x) dx = w{x) dx = 'E^w{x) 

^TT J ^TT J 7T"IXI J Z^ 



so that 



1 ^ 

ij:Mx«)^| (6) 

i=l 



as n — 7- oo. Then in the case of un-normalized densities the importance sampling estimate 
above is given by. 



E^a(x) 



An important issue is choice of the importance function, which generally needs to have 
heavier tails than the target distribution. Annealed importance sampling (AIS), developed by 
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Neal (2001) is a very clever algorithm which, broadly speaking, uses a tempering mechanism 
to adaptively define an importance sampling function to approximate the target posterior 
distribution. This is done by defining 



7rt,{e\y) = n{ey-'^n{9\yY\ where 1 = to > ■ ■ ■ > = 0. 

where 7r(0|y)*-' is an un-normalised density composed by raising the (un-normalised) posterior 
to power tj. Then {vrtg, . . . , TTt^} defines a sequence of distributions, transitioning from the 
prior distribution to the posterior distribution. The innovative aspect of this algorithm is 
to use a Markov transition kernel, for example, a Metropolis-Hastings kernel, to allow one 
to draw from an importance function which can be used to approximate 7T{6\y), the target 
distribution. 

Let Tj denote a Markov transition kernel with invariant nt-, for example, one based on 
Metropolis-Hastings updates. The AIS algorithm can now be sketched as follows. 

FOR i = 1,. . . ,N DO: 

SAMPLE 9m-i FROM TT^^ . 

SAMPLE 9m-2 FROM 6m-l USING T^-i 

SAMPLE 00 FROM 9i USING Ti 
SET 

^« = 00 AND W(9^'^) = ^t^^2iOm-2\y) TVtMy) 

Trt^{Om-i\y) 7it^_,{e^-2\y) T^tMy) 

END 

As is the case for standard importance sampling, AIS yields an independent sample 
{6i : i = 1, . . . , N} from the target distribution and this is a major strength of this approach. 
But also, by analogy to ([6]), it is possible to show that the average of the importance weights 
yield an estimator of the ratio of the normalising constants of the target distribution and 
the prior distribution, that is, an estimator of the evidence. 

1 ^ 

i=l 

AIS is very well suited to complicated posterior distributions which present difficulties 
for standard MCMC algorithms, in particular, the tempered aspect of the algorithm should 
help in situations where the posterior distribution is multi-modal. Note that from an im- 
plementation point of view, there are some key aspects to AIS. The choice of the transition 
kernel corresponding to the distribution at temperature tj is vital, and in practice one may 
need a few full sweeps of the parameters in order that the sequence 9m-i, ■ ■ ■ ,0i yields a rep- 
resentative draw 0*^*^ from the target distribution. In common with tempering algorithms, 
the choice of the temperature ladder tm, ■ ■ ■ ,to is vital. There has been some work in this 
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direction for the related tempered transition algorithm (Neal 1996), see (Behrens et al 2011). 
AIS is a popular method and is highly cited in the literature. It appears that it is primarily 
cited in the machine learning literature, and there is certainly much scope for it to apply in 
mainstream statistical settings. 

3.4.1 Running example: Gaussian model 

We first define the modified posteriors Titj{0\y) which in this case are 

T:t.{e\y) = T,{ef-'^T,{e\yY^ oc Tx{ef-'^ {'K{y\e)^{e)Y^ = 7r(0)7r(y|0)*^ (7) 

We must choose the transition kernel Tj with invariant iTt.. Here it is possible to choose a 
Gibbs kernel, since the full conditionals of /i and r at temperature tj are 



tj ^{yt - + bo 



t=l 



In implementing AIS, we would draw 9m-i = (/im-i, Tm-i)^ from the joint prior 7r(/i, r) = 
7r(/i)7r(r), then 9m-k would be sampled from 9m-k+i using Tm-fe+i , k = 2, . . . ,m. It will be 
necessary in practice to apply T^-fc+i a number of times to ensure that 9m-k is approximately 
distributed according to Tit . 

3.5 Nested sampling 



Nested sampling (Skilling 2006) is a generic algorithm for computing the evidence. The 
evidence may be viewed as the expectation, with respect to the prior, of the likelihood. As 
such, the evidence can be expressed as 

<y) = [ 7^{y\9)n{9) d9= [ 7r{y\9) dX, 



where dX = 7t{9) d9 is an element of prior mass. 
Define 



X(A) = / n{9) d9 

as a cumulant prior mass. Writing the inverse function as 7r(?/|X), that is, 7r(?/|X(A)) = A 
allows the evidence to be expressed as a 1— dimensional integral: 



<y)= ['niy\X)dX. 
Jo 



Suppose that we know how to evaluate the likelihood as U = 7r(?/|Xj) at a right-to-left 
sequence of / points < X/ < ■ ■ ■ < X2 < Xi < 1. Then any convenient quadrature method 
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would estimate the evidence as a weighted sum ^[^^ Wik. Since 7i{y\X) is non-increasing, it 
will be bounded below by any value evaluated at a larger value of X. Hence Wi = Xi — Xj+i 
with = gives a lower bound for estimating 7r(y). 

The main computational burden of nested sampling is the requirement to sample 6 from 
the prior subject to the constraint that vr(?/|6') > /, for value /. This is roughly similar to 



the computational cost of slice sampling (Neal 2003). The evidence is estimated by sorting 



draws from the prior according to their likelihood. 



i=l 



The nested sampling algorithm may be sketched as follows. 

Sample 6'^^), . . . , 6'^^) from the prior and initialize n{y) 
For i = 1 1 do: 



Find point 6i with the smallest likelihood, k, among the current 
^(^■)'s, j = l,...,iV. 

Set Xi = exp {-i/N} and Wi = Xi_i - X^. 
Increment Ti{y) by Uwi. 

Replace 9i with a point sampled from the prior subject to 'K{y\9) > k. 

END 



Note that the deterministic step Xi = exp{—i/N} could be replaced by a stochastic step. 
The algorithm may be terminated after a given number of iterations, when new contributions 
to 7r{y) are below some small fraction of the current value as suggested by Chopin and Robert 
(2010), or using one of the criteria outlined in (Skilling 2006). 

Nested sampling is a popular algorithm in the astronomy and related literatures. Yet it 
appears not to be widely used in mainstream statistics. Arguably, this is due to the com- 
putational overhead required to implement this algorithm. Note that Chopin and Robert 
(2010) illustrate, for the case where levels Xi follow a deterministic scheme, that the con- 
vergence rate is Oi^N^^"^). Their results hold where importance sampling is used to carry 
out the constrained prior sampling, and they note that similar results for an MCMC im- 
plementation of nested sampling are unknown. Finally Chopin and Robert (2010) illustrate 
that the algorithm performed well for a mixture model and a probit model, but that further 
experimentation is needed to determine situations where it may or may not work well. 



3.5.1 Running example: Gaussian model 

We would begin here by simulating 6'^*^ = (/i^*-*, r^*^)'^, z = 1,...,A^ from the prior and 
computing log7i(y\9'^^^) for each. Then 

Oi = arg mingg|g{i)_ g(iv)|log7r(y|6') 
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and li = 7r(?/|^i). 

We build up an approximation a to log7r(?/) by initializing this to the negative of the 
largest floating point number available in the machine on which computations are being 
carried out. Then a is incremented by li + log(l — exp{ — 1/iV}). We continue sampling from 
the prior until we obtain a 6 such that TT{y\0) > li, and replace 6i from the original sample 
from the prior with 9 and leaving all other 6'^*-' in the sample unchanged. The process is then 
repeated giving 

02 = arg mmg^^g(i) ^ g(N)^\ogTT{y\e) 

and I2 = 7!'{y\92), and a is incremented by I2 + log(exp{ — — exp{— 2/A^}). This is 
iterated until a new contribution to a is less than —8 log 10 + a, corresponding to a new 
contribution to the sum being less than 10~^ times the current value. 

An alternative to sampling from the prior until the constraint 7f {y\0) > li is satisfled, 
would be to choose one of the remaining sample 6 at random (which is already known 
to have vr(|/|^) > and move this a number of times subject to the constraint using a 
Metropolis-Hastings algorithm with target distribution 7r(^). An example would be 

Select 9 at random from the remaining sample not minimizing the likeli- 
hood. 

Draw 6' ~ qi-\0) from some proposal g(-) depending on 9. 

If TiiylO') < li, REJECT the proposal, otherwise accept with probability 

mmll/W; 



Repeat this process a given number of times until the dependency on 9 is 

FORGOTTEN. 

When we terminate the algorithm we add an end correction to a which should be negli- 



gible. See (Skilling 2006) and the pseudocode in the appendix there for details. 



3.6 Power posteriors 

In statistical physics, there is a large body of work concerned with methods for estimating 
normalising constants, or partition functions of statistical models. A notable example is the 
Ising model. The method of thermodynamic integration was developed in the statistical 
physics literature to handle such problems, and an excellent review of this approach, and 



other related methods, from a statistics perspective can be found in (Gelman and Meng 



1998). Friel and Pettitt (2008) explored how thermodynamic integration could be used 
for the speciflc instance where the normalising constant is the evidence arising from an 
un-normalised posterior distribution. Speciflcally they consider what they term the power 
posterior, 

7T{e\y,t)^{7Tiy\e)Y7T{e), 
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where t G [0, 1]. By construction the normahsing constant of the power posterior is 



z{y\t)= {n{y\9)y7^{9) d9, 
Je 

where z{y\t = 1) is the model evidence and z{y\t = 0) is the integral of the prior for 6, which 
equals 1. 

The model evidence follows the identity: 



z(y\t = l) 



log7r(|/) = log|^g^— = Eeit\og7r{y\e)dt. (9) 

To prove this one can show that the gradient of the log normalising constant can be expressed 
as the expected posterior deviance. 

.o.(.e#))^d. 

E0\t\og{7r{y\e)). (10) 



Integrating (10) with respect to t yields ([9]). In practice an estimator based on ^ is formu- 
lated by discretising t E [0, 1], = to < ^i, • • • , = 1- For each tj, a sample from n{9\y, tj) 
can be used to estimate Ej = E^it . log 7r(?/|6'). Finally, a trapezoidal rule can be used to 
approximate 

logvr(y)^fj(t,-VO(^^^y^) (11) 
Discretising the temperature introduces an error. The other source of error is the Monte 



Carlo error arising from approximating Ee|f^ log7r(?/|^^). Calderhead and Girolami (2009) 
have shown that the discretisation error depends on the Kullback-Liebler distance between 
Titj and vTi^+i, for j = 0, ... ,m — 1. Thus for a fixed number of temperatures, the optimal 
positioning of the {tj}, in terms of minimising the error due to the temperature discretisation, 
should be so that the Kullback-Liebler distance between successive tempered distributions 
is minimised. 

The power posterior approach is a generic method for estimating the evidence, and is 
relatively easy to implement. It is also often possible to implement it in WinBUGS. However, 
an immediate difficulty with this approach is that of choosing the temperature schedule, and 



this could be viewed as a weakness, although Behrens et al. (2011) offer some possibilities 
in this direction. 
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3.6.1 Running example: Gaussian model 

Here we may sample within a given temperature using the Gibbs sampler in ([s]). After 
generating samples 9^^\ . . . , 9^^^ within temperature tj we estimate 

N 

i?, ^5^1og7r(|/|^»). 

i=l 

To get starting values for the chain at temperature tj+i we use the average of the sampled 
values at tj, 

H's AT Z^i = l 's AT Z^2=l 



We sample for all temperatures tj and then estimate the log evidence using (11). 



3.7 Other approaches 

This article has focused mostly on evidence estimation methods based on MCMC sampling. 
However other approaches are also possible. Indeed evidence estimation is also possible 
using sequential Monte Carlo (SMC) methods, which is not surprising given that it is close 



in spirit to tempering methods. See Section 4 of (Del Moral et al 2006), where the evidence 



is estimated for a mixture model using SMC and AIS. For this example the authors conclude 
that SMC does not give superior performance to AIS. Note also that the authors argue that 
the issue of optimal temperature schedule (or path) in sequencing from, for example, prior to 
posterior is an important, but very challenging problem. This issue is clearly also of relevance 
to AIS and to the power posterior approach. Variational Bayes methods have been used to 
estimate the evidence for Ising models (Parise and Welling 2007), for example. Finally, 



we alert the reader to the fact that the integrated nested Laplace approximation (INLA) 



framework (Rue et al 2009) could potentially be useful as a means to provide an estimate of 



the evidence for the class of statistical models which can be represented as Gaussian Markov 
field models. This includes a wide class of models from auto-regressive time series models 
to stochastic volatility models. INLA software (www.r-inla.org/) gives the possibility to 



analyse such models providing an estimate of the model evidence. See for example, (Wyse 



et al 2011) in the context of change-point models with dependence between observations. 



4 Some examples 

Here we provide a brief numeric comparison of the methods reviewed in Section |3] on two 
full examples. The first of these compares two simple Gaussian linear non-nested regression 
models. The prior assumptions here lead to analytically tractable models and thus exact 
evaluation of the evidence and Bayes factor between the models, giving a benchmark for 
comparison of the methods in Section |3j The second example applies each of the methods 
to computing the evidence for competing logistic regression models. In this case there is no 
analytic tractability and the evidence must always be estimated via Monte Carlo sampling 
or other approximate techniques. 
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4.1 Non- nested linear regression models 

This data describes the maximum compression strength parallel to the grain density Xj 
and density adjusted for resin content Zi for n = 42 specimens of radiata pine. This data 



originates from (Williams 1959). It is wished to determine whether the density or resin- 
adjusted density is a better predictor of compression strength parallel to the grain. With 
this in mind, two Gaussian linear regression models are considered; 

Model 1: i/i = a + f3{xi — x) + ej, ej ~ A^(0, t~^), i = 1, . . . ,n, 
Model 2: yi = + S{zi - z) + rn, rji N{0, \~^), i = l,...,n. 

The priors assumed for the line parameters {a, ^^"^ (7; mean (3000, 185)"^ with 

precision (inverse variance-covariance) tQq and XQq respectively where Qo = diag(ro,so). 
The values of tq and sq were taken to be 0.06 and 6. A gamma prior with shape Oq = 6 
and rate = 4 x 300^ was taken for both r and A. These prior assumptions give rough 



equivalence with the priors assumed for this data in other analyses. See for example (Friel 



and Pettitt 2008[ ). 

It is possible to compute the exact marginal likelihood for both of these models due to 
the prior assumption that the precision on the mean of the regression line parameters is 
proportional to the error precision. For example, the marginal likelihood of Model 1 is given 
by 

^(y) _ -n/2,ao/2 r{(n + ao)/2} \Qo\^/'' (,+,„)/2 

where y = {yi, . . . , y^Y, M = X^X + Qo and R = I - XM'^X^ with the row of X 
equal to (1 Xi) and I is the 2x2 identity matrix. 

The exact value of the Bayes factor of Model 2 over Model 1 is given in Table [2] along 
with the Laplace approximation to this and mean and standard deviation of the estimate 
computed over eighteen runs of the Monte Carlo and nested sampling algorithms. For the 
Laplace approximation, the mode was located by using a standard Newton scheme. Out of 
interest we also computed the Laplace approximation at the MAP (maximum a posteriori) 
value of a simulation from the posterior. A series of boxplots illustrating the estimated 
log evidences and corresponding Bayes Factors computed over the eighteen runs is given in 
Figure [1} 

To make the each of the approaches for computing the evidence as comparable as possible, 
we tried to make each of the algorithms equivalent in terms of the number of iterations. For 
the Laplace approximation at the MAP and the harmonic mean estimator, results were based 
on a Gibbs sampling run of 505,000, taking 20% of this as a burn-in. For Chib's method, 
a Gibbs run sampling all the model parameters was run for 55,000 burn-in iterations with 
a subsequent 150,000 iterations before fixing the precision at it's modal value. Following 
this a run of 150,000 iterations for each of the regression line parameters in turn was carried 
out, fixing the previous line parameters at their modal value. For annealed importance 
sampling 1000 samples were generated from the posterior taking a geometric temperature 
ladder = (1 — z/100)^,z = 0, . . . , 100. Within each step of the ladder we sample five 
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Log evidence Model 1 Log evidence Model 2 Bayes Factor of Model 2 over Model 1 




Lap MAP HME CDib AIS Nesled Power posL Lap MAP HME CDib AIS Nesled Power posl Lap MAP HME Chib AIS Nesled Power posl 



Figure 1: Boxplots of evidence models 1 & 2 (left and middle panel) and computed Bayes 
factors (right panel) over eighteen runs of the algorithms of Section |3} The true value is 
shown by the dotted line. 



times from each tempered posterior to ensure a representative sample. Power posteriors 
used an identical (reversed) temperature ladder ti = (i/lOO)^,^ = 0, . . . , 100 to annealed 
importance sampling. Within each step of the ladder we ran the Gibbs sampling scheme for 
5000 iterations using 20% of this as a burn-in. Finally for nested sampling we terminated the 
algorithm when a new contribution to the nested sampling approximation of the evidence 
was less than 10~^ the current value as used by Chopin and Robert (2010). Fortunately, 
for the models in question, the full conditionals of the parameters was available even for the 
tempered posteriors. This was exploited by using Gibbs kernels in all of the Monte Carlo 
experiments. 

In this simple experiment it seems that Chib's method performs very well. Of course 



Method 


mean(i?F2i) 


S.E.(5F2i) 


Exact 


4553.65 




Laplace approximation 


4553.63 




Laplace approximation MAP 


4553.74 


1.05 


Harmonic mean estimator 


3827.47 


768.31 


Chib's method 


4553.69 


0.66 


Annealed importance sampling 


4597.89 


181.33 


Nested sampling 


5369.71 


3874.79 


Power posteriors 


4556.36 


66.90 



Table 2: Comparison of different approaches to estimating the Bayes factor of Model 2 over 
Model 1 based on eighteen runs of each algorithm for the Radiata Pine data. 
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this method assumes that the full conditionals are readily available and easy to sample from. 
The poorest performances are from the harmonic mean estimator and nested sampling, 
especially when looking at the standard error of the Bayes factor. Issues with the harmonic 
mean estimator have been mentioned earlier. It is possible that the performance of the nested 
sampling algorithm could be improved by better simulation from constrained distribution at 
high values of the likelihood, or maybe a stricter termination criterion. With the criterion 
used however, run times could still be time consuming. Chopin and Robert (2010) have 
proposed nested importance sampling, and suggested more efficient ways to simulate from 
the constrained likelihood distributions. This may alleviate some of the issues encountered 
here. It should be kept in mind that this example is quite straightforward. The simpler 
methods presented here may not work so well for more elaborate models. 

The results presented here should be interpreted as a rough guide to accuracy of the var- 
ious methods. We acknowledge that there are various ways in which some of the algorithms 
could be optimized, for example the tempering scale in annealed importance sampling and 
power posteriors, or the constrained sampling step in the nested sampling algorithm. We have 
employed by-and-large vanilla schemes here. For complete transparency, we have made the 
code available which was used to obtain these results at www.ucd. ie/ statdept/jwyse/Evidence . zip, 
including some documentation explaining the inputs to the code. We invite readers to ex- 
periment with different specifications to the algorithms and get a feel for the sensitivity of 
the results to prior assumptions. 



4.2 Choosing between two logistic regression models 

Here we examine the Pima Indians data which records instances of diabetes and a range of 
possible diabetes indicators for n = 532 Pima Indian women aged 21 years or over. There are 
seven potential predictors of diabetes recorded for this group; number of pregnancies (NP); 
plasma glucose concentration (PGC); diastolic blood pressure (BP); triceps skin fold thickness 
(TST); body mass index (BMI); diabetes pedigree function (DP) and age (AGE). This gives 129 
potential models (including a model with only a constant term). Diabetes incidence (y) is 
modelled by the likelihood 

n 
i=l 

where the probability of incidence for person i, pi, is related to the covariates (including 
constant term) Xi = (1, Xn, . . . , Xid)^ and the parameters 9 = {Oq, 9i, . . . , 6dY by 

where d is the number of explanatory variables. An independent multivariate Gaussian prior 
is assumed for the elements of 6*, so that 
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Method 


log 7r(y| Model 1) 


log 7r(y| Model 2) 


BFi2 


Relative speed 


Laplace approximation 


-257.26 


-259.89 


13.94 


1 


Chib & Jeliazkov's method 


-257.23 


-259.84 


13.66 


44 


Laplace approximation MAP 


-257.28 


-259.90 


13.77 


108 


Harmonic mean estimator 


-279.47 


-284.78 


203.12 


108 


Power posteriors 


-257.98 


-260.59 


13.71 


184 


Annealed importance sampling 


-257.87 


-260.43 


12.83 


194 


Nested sampling 


-258.82 


-261.38 


12.99 


808 



Table 3: Estimated log marginal likelihoods for each model and corresponding Bayes Factors 
for each method along with relative run times with r = 0.01. 



The covariates were standardized before analysis. 



A long reversible jump run (Green 1995) revealed that the two models with the highest 
posterior probability were 

Model 1: logit(p) = 1 + NP + PGC + BMI + DP 

and 

Model 2: logit(p) = 1 + NP + PGC + BMI + DP + AGE . 

This reversible jump algorithm assumed a non- informative value of r = 0.01 for the prior on 
the regression parameters. For this value of r we carried out a reduced reversible jump run 
restricting to jumps only between these two models. The prior probabilities of the models 
were adjusted to allow for very frequent jumps (about 29%). This gave a Bayes factor BF12 
of 13.96 which will be used as a benchmark to compare the other methods to. 

We compared each of the methods in Section [3] in estimating BF12. The results are shown 
in Table |3] ranked according to the speed of each of the methods. The results given for each 
of the methods were the best approximation to BF12 over five runs of the experiment. It 
was not possible to carry out an extensive simulation study here due to time constraints. 
The approximate relative speed is calculated by taking the average execution times of each 
of the algorithms over five runs of the experiment. 

Some measures were taken to try and make the implementations of each scheme as fair as 
possible for comparison in terms of speed, although we do not claim that our implementations 
are optimal. The code is available from www.ucd. ie/statdept/jwyse/Evidence .zip. Each 
Monte Carlo method used the equivalent of 200,000 samples. For example, the power poste- 
riors use 20,000 samples at each of 10 steps. The annealed importance sampling algorithm 
uses a tempering ladder of 101 to generate 2000 independent samples from the posterior. The 
Laplace at MAP and harmonic mean estimator both use 200,000 samples. Nested sampling 
uses 2,000 samples and is terminated when a new contribution to the approximation to Tc{y) 
is less than 10~® times the current value. 

As there are no full conditionals available for the tempered posteriors in this instance, 
some care is needed as to how to scale proposals within different (inverse) temperatures. It 
is desired to have wider proposals at lower values of t so that the algorithm can freely explore 
the support of the posterior. For this reason, when updating 6j in temperature t we chose a 
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Method 


log 7r(y| Model 1) 


log 7r(y| Model 2) 


BFi2 


Relative speed 


Laplace approximation 


-247.33 


-247.59 


1.31 


1 


Chib & Jeliazkov's method 


-247.31 


-247.58 


1.32 


40 


Laplace approximation MAP 


-247.33 


-247.62 


1.34 


98 


Harmonic mean estimator 


-259.84 


-260.55 


2.03 


98 


Power posteriors 


-247.57 


-247.84 


1.31 


169 


Annealed importance sampling 


-247.30 


-247.59 


1.33 


178 


Nested sampling 


-246.82 


-246.97 


1.15 


610 



Table 4: Estimated log marginal likelihoods for each model and corresponding Bayes Factors 
for each method along with relative run times with r = 1. 



proposal from a Gaussian distribution which is centered at 6j and with standard deviation 
(t^Tp)"^/^. Here we chose Tp = 2 and the value of a is so that the variability of the proposal 
near zero temperature would equal that of the prior. In the case of power posteriors, this 

was 

a = log(r/rp)/log(ti). 

The most notable performances from Table |3] are those of the harmonic mean estimator 
and nested sampling. Something which is of note here is the sensitivity of the Bayes factor 
to the value of r. For example, increasing r to 1, and hence being more informative about 
the variability of parameter values, reduces this Bayes factor to about 1.3 which would lead 
to a much weaker conclusion as to which was the better model for the data. When the 
experiment is run with r = 1 we note much better performance of all the algorithms in 
estimating BF12 (Table |4]). A value of -B-F12 = 1.3 was obtained from a long reversible jump 
run. 

5 Concluding thoughts 

The evidence is a fundamental quantity in Bayesian statistics. This short article has sur- 
veyed some widely used approaches in the literature and hopefully has encouraged interested 
readers to explore these methodologies further. The evidence is generally a difficult quantity 
to estimate, especially if the prior distribution is diffuse, and as such, it is not unsurprising 
that it may take some effort to implement, in terms of computational run time and computer 
coding. Often, the easiest to implement method may not give the most reliable estimates. 
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